Mac Terminal
RStudio
Github homepage
Configure Git command line for my GitHub account:
git config --global user.name "Miguel Desmarais"
git config --global user.email "migdesmarais@email.com"
Create new portfolio directory:
mkdir MICB425_portfolio
Initialize directory as new GitHub repo:
git init
Add files into github index (staging area):
git add .
Commit files to GitHub repo:
git commit -m "First commit"
Connect newly initiated GitHub repo:
git remote add origin https://github.com/migdesmarais/MICB425_portfolio.git
Verifiy web URL for new remote repo:
git remote -v
Push repo up to GitHub:
git push -u origin master
git add .
git commit -m "First commit"
git push -u origin master
The following assignment is an exercise for the reproduction of this .html document using the RStudio and RMarkdown tools we’ve shown you in class. Hopefully by the end of this, you won’t feel at all the way this poor PhD student does. We’re here to help, and when it comes to R, the internet is a really valuable resource. This open-source program has all kinds of tutorials online.
http://phdcomics.com/ Comic posted 1-17-2018
The goal of this R Markdown html challenge is to give you an opportunity to play with a bunch of different RMarkdown formatting. Consider it a chance to flex your RMarkdown muscles. Your goal is to write your own RMarkdown that rebuilds this html document as close to the original as possible. So, yes, this means you get to copy my irreverant tone exactly in your own Markdowns. It’s a little window into my psyche. Enjoy =)
hint: go to the PhD Comics website to see if you can find the image above If you can’t find that exact image, just find a comparable image from the PhD Comics website and include it in your markdown
Let’s be honest, this header is a little arbitrary. But show me that you can reproduce headers with different levels please. This is a level 3 header, for your reference (you can most easily tell this from the table of contents).
Perhaps you’re already really confused by the whole markdown thing. Maybe you’re so confused that you’ve forgotton how to add. Never fear! A calculator R is here:
1231521+12341556280987
## [1] 1.234156e+13
Or maybe, after you’ve added those numbers, you feel like it’s about time for a table!
I’m going to leave all the guts of the coding here so you can see how libraries (R packages) are loaded into R (more on that later). It’s not terribly pretty, but it hints at how R works and how you will use it in the future. The summary function used below is a nice data exploration function that you may use in the future.
library(knitr)
## Warning: package 'knitr' was built under R version 3.4.3
kable(summary(cars),caption="I made this table with kable in the knitr package library")
| speed | dist | |
|---|---|---|
| Min. : 4.0 | Min. : 2.00 | |
| 1st Qu.:12.0 | 1st Qu.: 26.00 | |
| Median :15.0 | Median : 36.00 | |
| Mean :15.4 | Mean : 42.98 | |
| 3rd Qu.:19.0 | 3rd Qu.: 56.00 | |
| Max. :25.0 | Max. :120.00 |
And now you’ve almost finished your first RMarkdown! Feeling excited? We are! In fact, we’re so excited that maybe we need a big finale eh? Here’s ours! Include a fun gif of your choice!
library(ggplot2)
metadata <- read.table(file="Saanich.metadata.txt", header=TRUE, row.names=1, sep="\t", na.strings="NAN")
ggplot(metadata, aes(x=NO3_uM, y=Depth_m)) +
geom_point(shape=17, size=2, colour="purple") +
scale_y_reverse() +
xlab("Nitrate (uM)") +
ylab("Depth (m)") +
ggtitle("Nitrate levels")
library(ggplot2)
library(tidyverse)
metadata <- read.table(file="Saanich.metadata.txt", header=TRUE, row.names=1, sep="\t", na.strings="NAN")
metadata %>%
mutate(Temperature_F = (Temperature_C)*(9/5)+32) %>%
ggplot() + geom_point(aes(x=Temperature_F, y=Depth_m)) +
scale_y_reverse() +
xlab("Temperature (˚C)") +
ylab("Depth (m)") +
ggtitle("Temperature")
library(ggplot2)
library(tidyverse)
library(phyloseq)
load("phyloseq_object.RData")
physeq_percent = transform_sample_counts(physeq, function(x) 100 * x/sum(x))
plot_bar(physeq_percent, fill="Domain") +
geom_bar(aes(fill=Domain), stat="identity") +
xlab("Sample depth") +
ylab("% relative abundance") +
ggtitle("Domain from 10-200 m in Saanich Inlet", subtitle = NULL)
metadatag <- metadata %>%
gather(key="variable", value="value", O2_uM, PO4_uM, SiO2_uM, NO3_uM, NH4_uM, NO2_uM)
ggplot(metadatag) +
facet_wrap(~variable, scales="free_x") +
geom_path(aes(x=value, y=Depth_m)) +
scale_y_reverse() +
xlab("µM") +
ylab("Depth (m)") +
ggtitle("Nutrient concentration")
What is an estimate of the number and total carbon content of prokaryotes on Earth?
What are the main prokaryotic habitats on Earth?
Are the numbers of prokaryotics in different habitats consistent with their turnover times?
What is the total C, N and P content of prokaryotes on Earth?
What is the cellular production rate of prokaryotes on Earth and how is it related to genetic diversity?
The primary methodological approach in this paper is to review and evaluate previous literature on the topics of prokaryotic abundance, turnover time and cellular production rate in different habitats and to extract representative numbers. These numbers were averaged, compared and used to calculate hypothetical prokaryotic abundance, turnover time and cellular production rates representative of each habitat. The uncertainty of results from certain habitats was also assessed by investigating different research results.
The estimate of prokaryotic abundance on Earth is 4-6 x 1030 cells: 1.2 x 1029 cells in the open ocean, 2.6 x 1029 in soil, 3.5 x 1030 in oceanic subsurfaces and 0.25-2.5 x 1030 in terrestrial subsurfaces. The estimates were found to be consistent with previously calculated turnover times for each habitat.
The total amount of carbon content of prokaryotes on Earth is estimated to be 60-100% (350-550 Pg) of the carbon contant of land plants. In addition, the prokaryotic nitrogen (85–130 Pg) and phosphorus (9–14 Pg) content is 10-fold the amount found in land plants.
Based on their estimated population size (4-6 x 1030 cells) and cellular production rate (1.7 x 1030 cells/yr), the authors argue that prokaryotes have a high capacity for genetic diversity, and that this diversity may be underestimated.
The open ocean has the highest cellular productivity and therefore this population is more likely to undergo mutations and rare genetic events.
Given their high diversity and mutation rates, should we change the way we defines prokaryotic species?
Do experiments in laboratory settings properly reflect events in aquatic or soil environments?
Given their high abundace, do subsurface prokaryotes play a major role in biogeochemical cycles on Earth?
Does the open ocean have a unique role in the planet given its important potential for prokaryotic diversity?
The methods were not explained adequately. Most of the methods were based on calculations (and numbers taken from literature), which were not always clearly outlined in the text. Since some of these methods are not outlined, the results should be used cautiously. It seemed like a lot of assumptions were made when calculating the carbon content and nitrogen/phosphorous content, and thus these results must be taken as rough estimates only. Nevertheless, the conclusions were still justified. Even if the numbers calculated were not as precise as one would want them to be, they are probably still within an acceptable range of the correct values. They show the great abundance and production rate of prokaryotes on Earth and highlight their role within habitats. The tables were easy to understand and helped to get an idea of how important prokaryotic life is on Earth.
Mouse over the data points in the below timeline to see the key events in the evolution of Earth systems.
Hadean: dry surface at 500 degrees C, 90 bar pressure on Earth’s surface; carbon dioxide tied up in carbonate materials such as limestone.
Archaean: strong greenhouse gases predominate (methane and carbon dioxide).
Proterozoic: oxidation of the early atmosphere - microaerobic.
Phanerozoic: increased oxygenation of the atmosphere, mass extinction events from meteorite impacts.
This paper aims to set planetary boundaries that should not be overstepped during the Anthropocene in order to prevent human activities from causing irreparable environmental damage/change. Question asked: What are these planetery boundaries? How can they be quantified? Have these planetary boundaries already been overstepped?
The authors have identified and quantified planetary boundaries with a “conservative, risk-averse approach”. The planetary boundaries are values selected for control variables that are at within reasonable distance from dangerous levels/thresholds. The methods used to do so are not clearly stated but are assumed to be based on previous literature. The current status of these boundaries were found from previous studies on climate change and the Anthropocene and assessed in relation to our current situation.
The authors identified 9 planetary boundaries: CLimate change, rate of biodiversity loss, nitrogen cycle, phosphorus cycle, stratospheric ozone depletion, ocean acidification, global freshwater use, change in land use, atmospheric aerosol loading and chemical pollution.
Out of these 9 planetary boundaries, 3 have already been overstepped: Climate change, rate of biodiversity loss and nitrogen cycle.
The boundaries are tightly coupled and need to be adressed as a whole.
New approach for defining preconditions for human development: “…address the scale of human action in relation to the capacity of Earth to sustain it, …work on understanding essential Earth processes including human actions and …research into resilience and its links to complex dynamics and self-regulation of living systems.”"
What direct actions can be done to limit the damage caused by our activities?
Is it too late to reset the nitrogen or phosphorus cycles or to limit damage on biodiversity?
How storng is the link between these 9 planetary boundaries and is overstepping one boundary enough to push other boundaries towards their threshold?
What feedbacks and other repercussions will come from our current actions?
Are there other potential boundaries that we are aware of today because they are not in danger?
The methods and approach used in this paper were not clearly stated. A part of the article could have be dedicated to describing the process by which these planetary boundaries and threshold were identified and set. The figures and tables were useful to get a grasp on the current state of these planetary boundaries, but still no detailed methods were shown. The article is a great summary of the planet situation in terms of its environmental condition. It is quite challenging to write about this topic since a lot of assumptions need to be made as per the inter-connectivity of systems in order to set thresholds and boundaries. I believe a more quantitative or mathematically-based study would be interesting.
Aquatic - prokaryotic life in aquatic environments is found in an order of magnitude greater in the upper 200 m of the continental shelf and open ocean than in the deep ocean (>200m). Fresh water values have a similar range. These number, of course, depend on the latitude of habitats and also unique environmental factors. Because of animal mixing and precipitation, the upper 10 cm of sediment in the ocean is included in aquatic habitat prokaryotic numbers. They have a short turnover time and therefore a high cellular productivity, which means that aquatic habitats have a great capacity to support life (varying with depth), especially in the photic zone. Total abundance (cells): 1.18 x 1029
Soil - prokaryotes are ubiquitous to soil environments. The amount of carbon input to soil habitats is high and fuels prokaryotic growth and their role in the soil decomposition subsystem. Given their carbon content, soil habitats have a high capacity for supporting life. Total abundance (cells): 2.556 x 1029
Subsurface - subsurface environments are major habitats for prokaryotes. They are seperated in marine sediments below 10 cm and terrestrial habitats below 8 m. Most of the subsurface biomass is supported by organic matter deposited from the surface. Most of the prokaryotes found in subsurface habitats are in the the upper 600 m layer. Total abundance (cells): 3.8 x 1030
Abundance in upper 200 m of the ocean (cells): 3.6 x 1028
Density (cells/mL): 5 x 105
(Cyanobacteria: 4 x 104 cells/ml/5x105 cells x 100 = 8%)
Fraction represented by Cyanobactera including Prochlorococcus: 8%
Cyanobacteria such as Prochlorococcus produce their own energy from sunlight via photosynthesis, which in the process produces oxygen while fixing carbon. Despite only being 8% of the total prokaryotic cells in the upper 200 m of the ocean, they are responsible for approximately 50% of the oxygen in the atmosphere and contribute greatly to carbon cycling as demonstrated by their quick turnover time and resulting 8.2 x 1029 cells/year.
Autotrophs: prokaryotes that produce their own food (fix CO2 to organic matter), primarily using energy from the sun.
Heterotrophs: prokaryotes that consume organic carbon to produce energy and synthesize building blocks of life.
Lithotrophs: prokaryotes that use inorganic compounds as an energy source to produce buidling blocks of life.
The Mariana Trench (10,994 meters, 11 km) is the deepest part of the ocean, and it is an environment that supports prokaryotic life. Because subsurface sediments below the water layer also support prokaryotic life, the argument could be made that the deepest habitat to host prokaryotic life is the subsurface sediment layer of the trench. Subsurface environments on land may contain prokaryotes further below that of the Mariana Trench. However, not much is currently known about life existing below these depths due to challenges in retrieving uncontaminated samples from these areas.
Deepest habitat: Mariana Trench (10,994 meters, 10.9 km) + 4.5 km including subsurface sediments, depending on temperature.
Primary limiting factor: temperature. Change in temperature as getting deeper is about 22 ˚C/km.
Prokaryotes have been found in the atmosphere at altitudes as great as 57-77 km. Mount Everest (8,848 meters, 8.8 km) is the highest geographical location on Earth, and therefore would be the highest habitat capable of supporting prokaryotic life. Is it capable of supporting prokaryotic life? Primary limiting factors at this height include temperature. Some prokaryotes, psychrophiles, have adapted to such low temperatures. Nutrients are also limited at high altitude; Less atoms are found in the upper atmosphere and thus less material is available to compose the building blocks of life. This would result in slower growth. UV radiation as well as pressure are limiting to life at high altitudes because they can damage cells.
Highest habitat: Mount Everest (8,848 meters, 8.8 km).
Primary limiting factor: temperature and building blocks of life
Lower limit: Mariana Trench: 10,994 meters (11 km) deep + 4.5km = 15.5 km.
Upper limit: Mount Everest: 8,848 meters (8.8 km) high, but the upper limit is much higher if it includes the atmosphere as an “habitat”.
Vertical distance of the Earth’s biosphere: 15.5 km + 8.8 km = 24.3 km (+ potential atmosphere).
Annual cellular production, in cells/year x 1029 was calculated with the following formula: Cells/year = Population Size x 365 days/turnover time (days) or Cells/year = Population Size x (turnover/year)
Viruses are very important as modulators of metabolism but also community structure, abundance, and turnover time/rates. Viral proteins can impact the metabolic potential of prokaryotes.
Carbon content along with carbon assimilation efficiency determine the upper bound limit on the turnover rates seen in the upper 200 m of the ocean. This varies with depth in the ocean, and between terrestrial and marine habitats because the abundance of carbon in each habitat is different.
(1 year x 365 days/year x 24 hours/day)/((4 x 10-7)4 mutations/cell)*(8.2 x 1029 cells/year)) = 4 simultaneous mutations every 0.4 hours
Turnover time for marine heterotrophs above 200m: 16 days
365 days/16 = 22.8 turnover/year
(3.6 x 1028 cells) x 22.8 turnover/year = 8.2 x 1029 cells regenerated/year
mutation rate per gene per DNA replication = 4 x 10-7
4 simultaneous mutations in every gene shared by population: (4 x 10-7)4 = 2.56 x 10-26
A large mutation rate means that there is a great potential for multiple point mutations in a single replication. This allows for quick adaptation by creating a more diverse pool of mutants to be selected from. Genetic diversity will be extremely high when small scale changes to sequence are considered. Long term “species” level biodiversity will mostly be determined by competition and environmental pressures. Horizontal gene transfer can allow new genes to proliferate in a microbial community assuming the gene is successful in the organism it is “born” in.
High abundance allows for high diversity by increasing the potential for mutations and simultaneous mutations. Metabolic potential is dependent on both abundance and diversity. Diversity determines the pool of available genes to be used in metabolic pathways and abundance determines the magnitude of the effect of these pathways.
Geophysical (abiotic): tectonic and atmospheric photochemical processes, which are based on acid/base chemistry.
Biochemical (biotic): microbial processes, which are based on redox reactions.
Abiotic processes rely on acid/base chemistry (the transfer of protons without electrons) while biotic processes rely on redox reactions (the transfer of electrons and protons). Therefore, the biotic processes are mostly responsible for the transformation of energy while biotic processes are mostly responsible for matter transformation. Some biotic processes can replenish substrates that are essential for life which otherwise would be depleted as abiotic processes come to thermodynamic equilibrium.
The earth’s redox state is considered an emergent property because it depends on both geochemical processes and microbial metabolic processes, which are both dynamic and always changing.
Reversible electron transfer reactions give rise to element and nutrient cycles because of the thermodynamic conditions that make each reaction favorable. Specifically, the rate at which each reaction occurs is determined by the conditions of the environment (i.e. abundance of substrates, products, etc.), thus allowing for nutrients to cycle in a stable manner. In this manner, the synergistic cooperation of different microbial communities can lead to different environments favorable to such reversible electron transfer reactions.
The different stages of the nitrogen cycle require different groups of microbes and different levels of oxygen, and is therefore partitioned between different redox “niches”.
During nitrogen fixation, nitrogen gas from the atmosphere is fixed to ammonium. This process occurs in surface waters where atmospheric nitrogen gas is in contact with water and therefore occurs in aerobic environments. Interestingly, the enzyme nitrogenase, which catalyzes this reaction, is inhibited by oxygen. Yet, nitrogen-fixing microbes have evolved a way to do so in aerobic environments.
Since oxygen is required to oxidize ammonium, nitrification occurs in aerobic environments and is a two-stage pathway. The two steps of nitrification (ammonia to nitrite, and nitrite to nitrate) are catalyzed by two different groups of microbes: ammonia-oxidizing bacteria/archaea and other nitrifying bacteria. These oxidation reactions are couple to carbon fixation in many cases.
Denitrification, the anaerobic reduction of nitrate and nitrite to nitrogen gas, occurs in anaerobic environments and is coupled to oxidation of organic matter.
This shows how modular the nitrogen cycle is and how different redox “niches” are interconnected. In this connection, chemical species are exchanged between groups of microbes located in different redox niches and are utilized for different purposes.
Indirectly, the nitrogen cycle is connected to climate change. All microbes require nitrogen to synthesize protein and nucleic acids, and the only method of nitrogen fixation is via microorganisms. The nitrogen cycle is what controls the amount of available fixed nitrogen, which in turn affects the number of microbes carrying out various other reactions, which in turn produces the Earth’s atmosphere. The nitrogen cycle is also directly connected to climate change, as nitrous oxide can be produced during denitrification. Nitrous oxide is known as a greenhouse gas.
Although there is enormous genetic diversity in nature, there remains a relatively stable set of core genes coding for the major redox reactions essential for life and biogeochemical cycles. Thus, microbial diversity does not necessarily entail diversity in proteins involved in metabolism.
It is hypothesized that there is limitless evolutionary diversity in nature. The rate of discovery of unique protein families has been proportional to the sampling effort, with the number of new protein families increasing approximately linearly with the number of new genomes sequenced.
Microbes are considered guardianship of metabolism on a temporary and simultaneous basis. This is because of the nature of microbial evolution from horizontal and vertical gene transfer which can change what phenotype is dominant at select time. A dominant phenotype protects the metabolic pathway in the environment. If it does not survive environmental perturbations, (applying selective pressures on pathway genes) it will disappear. Humans could possibly replicate the individual pathways, but the overall metabolic biogeochemical processes that control the flow of electrons can only be done by microbes.
Falkowski, P. G., Fenchel, T., & Delong, E. F. (2008). The Microbial Engines That Drive Earths Biogeochemical Cycles. Science, 320(5879), 1034-1039. doi: 10.1126/science.1153213
Nisbet, E. G., & Sleep, N. H. (2001). The habitat and nature of early life. Nature, 409(6823), 1083-1091. doi: 10.1038/35059210
Rockström, J., Steffen, W., Noone, K., Persson, Å, Chapin, F. S., Lambin, E. F., . . . Foley, J. A. (2009). A safe operating space for humanity. Nature, 461(7263), 472-475. doi: 10.1038/461472a
Whitman, W. B., Coleman, D. C., & Wiebe, W. J. (1998). Prokaryotes: The unseen majority. Proceedings of the National Academy of Sciences, 95(12), 6578-6583. doi: 10.1073/pnas.95.12.6578
Discuss the relationship between microbial community structure and metabolic diversity
Evaluate common methods for studying the diversity of microbial communities
Recognize basic design elements in metagenomic workflows
The authors of this paper wanted to get a more detailed characterization of the genetics and biochemistry of the proteorhodopsin (PR)-based photosystem in marine picoplankton communities of the photic zone.
What genes generate a fully functional PR photosystem? Do these genes allow photophosphorylation when cells are exposed to light (are they sufficient for photophosphorylation)? Why is the PR system ubiquitous in marine prokaryotes?
Screening: The authors used a large-insert DNA fosmid library prepared from ocean surface water picoplankton and screened for PR clones on retinal-containing LB agar medium. To select for PR-containing clones, clones were selected based on wether or not they showed a orange or red phenotype (which is expected for PR-containing clones growring on this type of media).
Sequencing: The authors a collection of transposon-insertion clones to allow for a rapid DNA sequencing and localization of insertion mutants. Insertion mutant can allow for phenotypic analysis of gene functions. The authors also used a fosmid system in which the copy number can be increased.
Phenotypic analysis was preformed using selective, but also differential media.
Light-activated proton translocation was assessed by measuring light-dependent pH fluctuations. Photophosphorylation was measured using a luciferase-based assay.
Six genes are sufficient to generating a fully functional PR photosystem in E.coli and therefore in ocean surface water picoplankton. Out of these 6 genes, 5 encode for accessory photopigment biosynthetic proteins while one encode for PR. This 6-gene protein system allows for light-activated proton translocation and for photophosphorylation (ATP synthesis) in cells. Given it’s size, a single horizontal gene transfer event can lead to acquired phototrophic capababilities and could explain the ubiquitous PR system in microbial communities of the ocean surface waters.
Can this method of identifying fosmid library clones can be applied to other gene sets/would other biomarkers or phenotypes can be used to select for specific clones?
Are there other genetic systems that can be easily acquired by microbial populations in the environment?
Since PR photosystems have different UV absorption minima and maxima, would the type of PR acquired by a certain microbe be based on its environment, and therefore decrease its ability to thrive in other light enviornments?
Would more organisms contain a PR photosystem since it is so easily transferable (more than 13% of bacteria in marine picoplankton populations?)?
The authors did not properly introduce/could have alaborated on the current methods being used in the identification and characterization of functional genes and how their method used for PR is revolutionary. The methods were described adequatly but some background information such as more information on the Fosmid library consruction and other techniques/approaches was missing in order to understand the experimental logic. More figures to represent the library screening or copy-controlled mechanisms could have been helpful to understand some of the methods, but most figures were helpful. The conclusions are justified, although more studies would have to be done to confirm how PR genes easily transfer between organisms and if it is an easily acquired capabality.
As of 2014, 13 540 prokaryotic species with formal scientific names were included in the NCBI taxonomy database ( Federhen 2014 ). Out of these, 13 029 are part of the bacteria domain and 511 of the archaea domain. This number does not include species without formal names and has most likely increased over the past year or so. Indeed, the number of “validly named” bacterial and archaeal species is expected to greatly surpass 13 000 by the end of 2016 ( Amann and Rosselló-Móra 2016 ).
To touch more on prokaryotic “divisions”: according to a recent study by Solden et al. ( 2016 ), 89 bacterial phyla and 20 archaeal phyla had been identified by small subunit rRNA by 2016. However, as much as 1 500 bacterial phyla are thought to exist. In this context, “divisions” are considered as phylum-level lineages which are the highest level grouping.
According to a study by Hugenholts ( 2002 ), there is a major bias towards the “big four” bacterial phyla since they are culturable in a lab environments. Numbers show that ~63% of phylum-level lineages have cultured representatives while ~37% do not. It is widely accepted that more than 99% of prokaryotes in an environmental sample are not culturable in a lab environment. Therefore, these numbers definitely show a bias towards culturable microorganisms. Note that “non-culturable” is a subjective term since improvement in growth medium and understanding of unique prokaryotic environments could allow these organisms to grow in a lab environment. A slightly more recent study showed that out of 52 identifiable major lineages, only 26 had cultured representatives ( Rappé and Giovannoni 2003 ). With the recent advance of sequencing technology, this bias must have decreased and the numbers may show a more realistic proportion of unculturable vs. culturable prokaryotes in online databases.
As per EBI Metagenomics:
As of today, 110 217 metagenomic deta sets/porjects exist on the EBI Metagenomics website. Most data sets are public but a few subsets are pirvate. These project metagenomic sequences originate from a multitude of environments: soil, human, engineered, marine, freshwater, mammals, plants, grassland, etc.
As per the DOE Join Genome Institute - IMG:
As of today, 9 877 metagenome sequencing projects are currently available in the DOE JGI public domain. These projects are sourced from air, aquatic and terrestrial envrionments. There are also non-environmental project sources from host-associated samples and engineering environments.
Shotgun metagenomics:
Data warehousing: ING/M, MG-RAST, NCBI. These are databases with different levels of currated datasets.
Assembly: EULER-SR
Binning: S-GCOM
Annotation: KEGG
Analysis pipelines: Megan 5
Marker gene metagenomics:
Standalone software: OTUbase
Analysis pipelines: SILVA (gold standard for rRNA identification)
Denoising: Amplicon Noise
Databases: Ribosomal Database Project (RDP, gold standard for rRNA identification)
Phylogenetic gene anchors are slow evolving marker genes containing variable regions, such as 16S rRNA/rDNA, that can be used to taxonomically identify microorganisms. For example, the 16S rRNA gene can be used in metagenome analysis to identify metabolically active prokaryotes in a select sample. In addition, only the phylogenetic gene anchor can be sequenced to provide an estimate of diversity and community composition. A phylogenetic tree can then be built based on how different these variable regions are between microorganisms ( Krause et al. 2008 ).
Functional gene anchors are similar to phylogenetic gene anchors but represent certain metabolic pathways present in certain members on the community. These markers can be used to track the metabolic potential of a community, but can also be used to built a phylogenetic tree inside a specific subgroup of a community with the same metabolic function and see how they are related ( The New Science of Metagenomics 2007 ).
Metagenomic sequence binning is the process by which sequence data is associated to its original OTU. It groups sequence reads together that come or are thought to come from the same genome. A typical contamination threshold used in this process is 5% (meaning that 5% of sequence reads inside a bin do not belong to this OTU/bin).
Types of algorithmic approaches used to produce sequence bins include composition-based binning (TETRA, GSOM,S-GSOM) and phylogenetic binning (or similarity-based binning, MEGAN, CARMA). Composition-based binning use characteristics such as GC content, k-mer content and condon usage while phylogenetic binning use reference sequences to classify sequence to its correct bin or OTU.
Composition-based binning is great since it does not require reference sequences. However, it may be problematic when OTUs are closer in identity and more numerous in a sample. Phylogenetic binning is great when sequences in the sample have similarities to reference sequences but is not adequate for identifying organisms that have never been sequenced, since they lack reference sequences. Risk associated with metagenomic binning include grouping reads that have similar characteristics but do not originate from the same cell (based on k-mer content, codon usage, etc.). Defining what a species is and setting threshold is a crucial part of this process.
Sequencing of the phylogenetic gene anchor 16S rRNA can be done to understand the metabolic activity of the microbes in the community. This method, however, only provides with the community composition and not the metabolic potential. On this note, the same type of sequencing can be done with functional gene anchors to understand what metabolic pathways are present in a sample community.
Universal primers need to be selected prior to the experiment for amplifying and sequencing these gene anchors. Risks associated with this method include the selection of innapropriate primers: if microorganisms in a sample differ greatly from the most known species, this experiment will result in many OTUs or functional gene anchors not being sequenced, and therefore not being identified. In addition, horizontal gene transfer and the different numbers gene copy in organisms can also alter results.
Other alternatives:FISH probing, functional screens (biocehmical, etc), 3rd generation sequencing ( Nanopore ), single cell sequencing (Single-cell amplified genomes, SAGs).
Amann, R., & Rosselló-Móra, R. (2016). After All, Only Millions? MBio, 7(4). doi: 10.1128/mbio.00999-16
Federhen, S. (2014). Type material in the NCBI Taxonomy Database. Nucleic Acids Research, 43(D1). doi: 10.1093/nar/gku1127
Hugenholtz, P. (2002). Exploring prokaryotic diversity in the genomic era. Genome Biol, 3(2). doi: 10.1186/gb-2002-3-2-reviews0003
Krause, L., Diaz, N. N., Goesmann, A., Kelley, S., Nattkemper, T. W., Rohwer, F., . . . Stoye, J. (2008). Phylogenetic classification of short environmental DNA fragments. Nucleic Acids Research, 36(7), 2230-2239. doi: 10.1093/nar/gkn038
Madsen, E. L. (2005). Identifying microorganisms responsible for ecologically significant biogeochemical processes. Nature Reviews Microbiology, 3(5), 439-446. doi: 10.1038/nrmicro1151
Martinez, A., Bradley, A. S., Waldbauer, J. R., Summons, R. E., & Delong, E. F. (2007). Proteorhodopsin photosystem gene expression enables photophosphorylation in a heterologous host. Proceedings of the National Academy of Sciences, 104(13), 5590-5595. doi: 10.1073/pnas.0611470104
Rappé, M. S., & Giovannoni, S. J. (2003). The Uncultured Microbial Majority. Annual Review of Microbiology, 57(1), 369-394. doi: 10.1146/annurev.micro.57.030502.090759
Solden, L., Lloyd, K., & Wrighton, K. (2016). The bright side of microbial dark matter: lessons learned from the uncultivated majority. Current Opinion in Microbiology, 31, 217-226. doi: 10.1016/j.mib.2016.04.020
The New Science of Metagenomics. (2007). doi: 10.17226/11902
Wooley, J. C., Godzik, A., & Friedberg, I. (2010). A Primer on Metagenomics. PLoS Computational Biology, 6(2). doi: 10.1371/journal.pcbi.1000667
Obtain a collection of “microbial” cells from “seawater”. The cells were concentrated from different depth intervals by a marine microbiologist travelling along the Line-P transect in the northeast subarctic Pacific Ocean off the coast of Vancouver Island British Columbia.
Sort out and identify different microbial “species” based on shared properties or traits. Record your data in this Rmarkdown using the example data as a guide.
Once you have defined your binning criteria, separate the cells using the sampling bags provided. These operational taxonomic units (OTUs) will be considered separate “species”. This problem set is based on content available at What is Biodiversity.
library(kableExtra)
library(knitr)
library(tidyverse)
fullcommunity = read.csv("fullcommunity.csv")
fullcommunity %>%
kable("html") %>%
kable_styling(bootstrap_options = "striped", font_size = 10, full_width = F)
| number | name | characteristics | occurences |
|---|---|---|---|
| 1 | orangebear | orange gummy bear | 14 |
| 2 | pinkbear | pink gummy bear | 16 |
| 3 | yellowbear | yellow gummy bear | 16 |
| 4 | whitebear | white gummy bear | 16 |
| 5 | redbear | red gummy bear | 10 |
| 6 | green bear | green gummy bear | 18 |
| 7 | pinksolidbear | solid pink gummy bears | 1 |
| 8 | redsolidbear | solid red gummy bears | 1 |
| 9 | jellybeanyellow | yellow jelly bean | 26 |
| 10 | jellybeanorange | orange jelly bean | 31 |
| 11 | jellybeanpink | pink jelly bean | 37 |
| 12 | jellybeangreen | green jelly bean | 33 |
| 13 | jellybeanred | red jellybean | 38 |
| 14 | jellyrodsyellow | long yellow jelly rods | 3 |
| 15 | jellyrodsorange | long orange jelly rods | 1 |
| 16 | jellyrodsred | long red jelly rods | 2 |
| 17 | jellyrodsblack | long black jelly rods | 1 |
| 18 | redswirl | red swirl | 1 |
| 19 | blueswirl | blue swirl | 2 |
| 20 | ovalyellow | yellow oval | 1 |
| 21 | redfilamentous | red, long string-like | 5 |
| 22 | orangespider | orange spider-shaped | 2 |
| 23 | pinkspider | pink spider-shape | 3 |
| 24 | purplespider | purple spider-shape | 1 |
| 25 | coke-bottlered | red coke bottle | 1 |
| 26 | cokebottlepink | pink coke bottle | 1 |
| 27 | cokebottleyellow | yellow coke bottle | 1 |
| 28 | kissessilver | silver kisses | 15 |
| 29 | bigbluebrick | large blue brick | 1 |
| 30 | bigyellowbrick | large yellow brick | 1 |
| 31 | bigpinkbrick | large pink brick | 1 |
| 32 | smallbluebrick | small blue brick | 3 |
| 33 | small yellow brick | small yellow brick | 3 |
| 34 | smallgreenbrick | small green brick | 2 |
| 35 | smallpinkbrick | small pink brick | 6 |
| 36 | biggreen | large round green jelly | 5 |
| 37 | bigpurple | large round purple jelly | 3 |
| 38 | bigred | large round red jelly | 7 |
| 39 | bigorange | large round orange jelly | 5 |
| 40 | bigyellow | large round yellow jelly | 3 |
| 41 | m&morange | orange m&m | 59 |
| 42 | m&mgreen | green m&m | 30 |
| 43 | m&mred | red m&m | 27 |
| 44 | m&myellow | yellow m&m | 34 |
| 45 | m&mbrown | brown m&m | 30 |
| 46 | m&mblue | blue m&m | 60 |
| 47 | skittlesyellow | yellow skittle | 34 |
| 48 | skittlesorange | orange skittle | 39 |
| 49 | skittlered | red skittle | 30 |
| 50 | skittlebrown | brown skittle | 30 |
| 51 | skittlegreen | green skittle | 42 |
mycommunity = read.csv("mycommunity.csv")
mycommunity %>%
kable("html") %>%
kable_styling(bootstrap_options = "striped", font_size = 10, full_width = F)
| number | name | characteristics | occurences |
|---|---|---|---|
| 1 | orangebear | orange gummy bear | 1 |
| 2 | pinkbear | pink gummy bear | 2 |
| 3 | yellowbear | yellow gummy bear | 5 |
| 4 | whitebear | white gummy bear | 5 |
| 5 | greenbear | green gummy bear | 4 |
| 6 | jellyrodsyellow | long yellow jelly rods | 6 |
| 7 | jellyrodsorange | long orange jelly rods | 9 |
| 8 | jellyrodsred | long red jelly rods | 5 |
| 9 | redswirl | red swirl | 1 |
| 10 | blueswirl | blue swirl | 1 |
| 11 | ovalyellow | yellow oval | 1 |
| 12 | redfilamentous | red, long string-like | 2 |
| 13 | cokebottlepink | pink coke bottle | 1 |
| 14 | kissessilver | silver kisses | 7 |
| 15 | smallbluebrick | small blue brick | 1 |
| 16 | smallyellowbrick | small yellow brick | 1 |
| 17 | smallpinkbrick | small pink brick | 1 |
| 18 | biggreen | large round green jelly | 1 |
| 19 | bigpurple | large round purple jelly | 1 |
| 20 | bigred | large round red jelly | 1 |
| 21 | bigorange | large round orange jelly | 1 |
| 22 | m&morange | orange m&m | 8 |
| 23 | m&mgreen | green m&m | 9 |
| 24 | m&mred | red m&m | 3 |
| 25 | m&myellow | yellow m&m | 10 |
| 26 | m&mbrown | brown m&m | 7 |
| 27 | m&mblue | blue m&m | 8 |
| 28 | skittlesyellow | yellow skittle | 5 |
| 29 | skittlesorange | orange skittle | 9 |
| 30 | skittlered | red skittle | 5 |
| 31 | skittlebrown | brown skittle | 1 |
| 32 | skittlegreen | green skittle | 10 |
My collection of microbial cells from seawater does not fully represent the actual diversity of microorganisms inhabiting waters along Line-P transect. Only 32 out of the 51 species were identified in my sample compared to the full community.
library(ggplot2)
collector= read.csv("collector.csv")
ggplot(collector, aes(x=x, y=y)) +
geom_point() +
geom_smooth() +
labs(x="Cumulative number of individuals classified", y="Cumulative number of species observed")
The collector’s curve for my sample does not flatten out.
I can conclude from the shape of my collector’s curve that the depth of my sampling is not the greatest. Since the curve does not flatten out, more species would most likely be found if more sampling was done. The collector’s curve and this conclusion agree with the difference in the number of species found in my sample and in the full community (32 and 51, resepctively).
\(\frac{1}{D}\) where \(D = \sum p_i^2\)
\(p_i\) = the fractional abundance of the \(i^{th}\) species
The higher the value is, the greater the diversity. The maximum value is the number of species in the sample, which occurs when all species contain an equal number of individuals. Because the index reflects the number of species present (richness) and the relative proportions of each species with a community (evenness), this metric is a diveristy metric. Consider that a community can have the same number of species (equal richness) but manifest a skewed distribution in the proportion of each species (unequal evenness), which would result in different diveristy values.
orangebear = 1/132
pinkbear = 2/132
yellowbear = 5/132
whitebear = 5/132
greenbear = 4/132
jellyrodsyellow = 6/132
jellyrodsorange = 9/132
jellyrodsred = 5/132
redswirl = 1/132
blueswirl = 1/132
ovalyellow = 1/132
redfilamentous = 2/132
cokebottlepink = 1/132
kissessilver = 7/132
smallbluebrick = 1/132
smallyellowbrick = 1/132
smallpinkbrick = 1/132
biggreen = 1/132
bigpurple = 1/132
bigred = 1/132
bigorange = 1/132
mnmorange = 8/132
mnmgreen = 9/132
mnmred = 3/132
mnmyellow= 10/132
mnmbrown = 7/132
mnmblue = 8/132
skittlesyellow = 5/132
skittlesorange = 9/132
skittlered = 5/132
skittlebrown = 1/132
skittlegreen = 10/132
1 / (orangebear^2 + pinkbear^2 + yellowbear^2 + whitebear^2 +
greenbear^2 + jellyrodsyellow^2 + jellyrodsorange^2 + jellyrodsred^2 + redswirl^2 + blueswirl^2 + ovalyellow^2 + redfilamentous^2 + cokebottlepink^2 + kissessilver^2 + smallbluebrick^2 + smallyellowbrick^2 + smallpinkbrick^2 + biggreen^2 + bigpurple^2 + bigred^2 + bigorange^2 + mnmorange^2 + mnmgreen^2 + mnmred^2 + mnmyellow^2 + mnmbrown^2 + mnmblue^2 + skittlesyellow^2 + skittlesorange^2 + skittlered^2 + skittlebrown^2 + skittlegreen^2)
## [1] 19.89041
Calculated on spreadsheet: 23.1744939
Another way to calculate diversity is to estimate the number of species that are present in a sample based on the empirical data to give an upper boundary of the richness of a sample. Here, we use the Chao1 richness estimator.
\(S_{chao1} = S_{obs} + (\frac{a^2}{2b})\)
\(S_{obs}\) = total number of species observed a = species observed once b = species observed twice or more
\(S_{chao1}\) =
32 + 13^2/(2*19)
## [1] 36.44737
\(S_{chao1}\) =
51 + 13^2/(2*38)
## [1] 53.22368
library(tidyverse)
library(dplyr)
library(vegan)
mycommunity_diversity =
mycommunity %>%
select(name, occurences) %>%
spread(name, occurences)
fullcommunity_diversity =
fullcommunity %>%
select(name, occurences) %>%
spread(name, occurences)
mycommunity_diversity
## biggreen bigorange bigpurple bigred blueswirl cokebottlepink greenbear
## 1 1 1 1 1 1 1 4
## jellyrodsorange jellyrodsred jellyrodsyellow kissessilver m&mblue
## 1 9 5 6 7 8
## m&mbrown m&mgreen m&morange m&mred m&myellow orangebear ovalyellow
## 1 7 9 8 3 10 1 1
## pinkbear redfilamentous redswirl skittlebrown skittlegreen skittlered
## 1 2 2 1 1 10 5
## skittlesorange skittlesyellow smallbluebrick smallpinkbrick
## 1 9 5 1 1
## smallyellowbrick whitebear yellowbear
## 1 1 5 5
fullcommunity_diversity
## bigbluebrick biggreen bigorange bigpinkbrick bigpurple bigred bigyellow
## 1 1 5 5 1 3 7 3
## bigyellowbrick blueswirl coke-bottlered cokebottlepink cokebottleyellow
## 1 1 2 1 1 1
## green bear jellybeangreen jellybeanorange jellybeanpink jellybeanred
## 1 18 33 31 37 38
## jellybeanyellow jellyrodsblack jellyrodsorange jellyrodsred
## 1 26 1 1 2
## jellyrodsyellow kissessilver m&mblue m&mbrown m&mgreen m&morange m&mred
## 1 3 15 60 30 30 59 27
## m&myellow orangebear orangespider ovalyellow pinkbear pinksolidbear
## 1 34 14 2 1 16 1
## pinkspider purplespider redbear redfilamentous redsolidbear redswirl
## 1 3 1 10 5 1 1
## skittlebrown skittlegreen skittlered skittlesorange skittlesyellow
## 1 30 42 30 39 34
## small yellow brick smallbluebrick smallgreenbrick smallpinkbrick
## 1 3 3 2 6
## whitebear yellowbear
## 1 16 16
Simpson Reciprocal Index for my sample:
diversity(mycommunity_diversity, index="invsimpson")
## [1] 19.89041
Simpson Reciprocal Index for community:
diversity(fullcommunity_diversity, index="invsimpson")
## [1] 23.17449
The Simpson Reciprocal Indices values obtained from the R function do match the value calaculated for my sample and community.
Chao1 richness estimate for my sample:
specpool(mycommunity_diversity)
## Species chao chao.se jack1 jack1.se jack2 boot boot.se n
## All 32 32 0 32 0 32 32 0 1
Chao1 richness estimate for community:
specpool(fullcommunity_diversity)
## Species chao chao.se jack1 jack1.se jack2 boot boot.se n
## All 51 51 0 51 0 51 51 0 1
The Chao1 R function values do not matched my calculated value for my sample and community. It seems like the R function uses a different way of calculating Chao1 in which the total number of species observed is the chao1 value. Therefore, my calculated Chao1 values are slightly higher for both my sample and the community.
If you are stuck on some of these final questions, reading the Kunin et al. 2010 and Lundin et al. 2012 papers may provide helpful insights.
The measure of diversity, alpha diversity in this case, directly depends on the definition of species in my sample and community. Since both the Simpson Reciprocal Indices value and the Chao1 value are essentially based, in different ways, on the total number of species observed, changing the definition of species will have an effect on the number of species observed and alpha diversity.
For example, many studies use the 97% identity threshold value to definie species. Raising this value to 99% would result in more species observed in a sample (overestimating diversity) while lowering this threshold to 95%, for example, would results in more “identical” sequences and therefore less species obserbed in a sample (underestimating diversity).
Binning or clustering data based on the GC content, codon usage or percent identity will most likely result in a different number of species observed in a sample since these methods are based on different sequence characteristics.
Inflated diversity have been previously associated with pyrosequencing errors. Sanger, a sequencing technology with a relatively low error rate, has shown a constant overestimation of diversity.Therefore, sequencing technologies with a higher base calling/sequencing error rate such as PacBio or Oxford Nanopore will most likely result in much more species observed in a sample. This is due to the potential introduction of wrong bases in a certain DNA sequence. This may in turn lead to the sequence falling bellow he 97% identity threshold and therefore being categorized as a different and new species observed in the sample.